Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid writing tabular files within pipelines. #29

Open
wants to merge 1 commit into
base: metadata_arguments
Choose a base branch
from

Conversation

lkitching
Copy link
Contributor

Add implementations of the csv2rdf RowSource protocol which allow
transformed versions of pipeline input files to be passed directly.

The RowSource protocol represents tabular resources as a logical
sequence of records, each containing the source row number and parsed
data cells. Pipelines previously wrote transformed version of the
input files to disk so they could be passed to the CSVW process.

Add implementations of hte RowSource protocol which allow the
transformation process to be done in memory and presents the
transformed row records directly into the CSVW process.

The number of component specifications derived within cube-pipeline
is expected to be quite small. Load these into memory and add
a RowSource implementation which returns the corresponding tabular
rows to csv2rdf.

Update the tests which check the format of the intermediate
transformed data to used the transformed row sources.

Add implementations of the csv2rdf RowSource protocol which allow
transformed versions of pipeline input files to be passed directly.

The RowSource protocol represents tabular resources as a logical
sequence of records, each containing the source row number and parsed
data cells. Pipelines previously wrote transformed version of the
input files to disk so they could be passed to the CSVW process.

Add implementations of hte RowSource protocol which allow the
transformation process to be done in memory and presents the
transformed row records directly into the CSVW process.

The number of component specifications derived within cube-pipeline
is expected to be quite small. Load these into memory and add
a RowSource implementation which returns the corresponding tabular
rows to csv2rdf.

Update the tests which check the format of the intermediate
transformed data to used the transformed row sources.
@Robsteranium
Copy link
Contributor

I guess this complements #27? Whereas that deals with the metadata, this deals with the tables themselves?

Sorry this hasn't been reviewed sooner @lkitching .

Is it still valid? I guess we'll need to update it to resolve merge conflicts. I wonder if there's any interaction now with #120?

Ideally we'll reach the point where we can run as either a) csv->csvw or b) csv->rdf to support both interop/ scrutability and overall efficiency respectively.

@lkitching
Copy link
Contributor Author

@Robsteranium - I'm not sure we want to use this any more. We don't use #27 any more either within #120 since we always write the CSVW to disk. We could resurrect this approach in future since the infrastructure still exists within csv2rdf but it's probably more effort that it's worth for now.

@Robsteranium
Copy link
Contributor

Robsteranium commented Apr 21, 2020

There was a reason for doing this though wasn't there... Was it an OOME or that it was faster without I/O? I can't remember!

Let's leave the PR and branch open in case we want to reintroduce it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants